Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow GCSTarget's as input and GCSFlagTarget's as output for Hadoop jobs #1664

Merged
merged 3 commits into from
May 11, 2016
Merged

Allow GCSTarget's as input and GCSFlagTarget's as output for Hadoop jobs #1664

merged 3 commits into from
May 11, 2016

Conversation

geowurster
Copy link
Contributor

This PR allows GCSTarget()'s to be valid input and GCSFlagTarget()'s to be valid output for Hadoop JobTask()'s just like S3Target()'s.

Looking at the tests in test/contrib/hadoop_teset.py I don't see the s3 functionality tested anywhere, so this PR does not introduce any new tests, but I am happy to add some with a bit of guidance.

When running a Hadoop job on Google Compute Engine Google Cloud Storage can be used as HDFS, so gs://bucket/path/to/resource URL's are valid in addition to normal HDFS paths.

Also like S3FlagTarget(), an exception is raised if end_job_with_atomic_move_dir=True and the output is a GCSFlagTarget(). I'm not 100% positive this is the correct behavior, but when I tried to run a test job with atomic move a UserWarning was issued stating the client doesn't support atomic move, and the job ultimately failed. I'm on the fence about this portion of the PR and feel like I don't necessarily have enough information to confidently implement this, so I'm happy to remove it, but unfortunately I don't have time to investigate if the atomic move is possible on GCS. We found that it was prohibitively slow anyway.

/usr/local/lib/python2.7/dist-packages/luigi/target.py:182: UserWarning: File system HdfsClient client doesn't support atomic mv.
  warnings.warn("File system {} client doesn't support atomic mv.".format(self.__class__.__name__))
DEBUG:luigi-interface:Running file existence check: hadoop fs -stat gs://bucket/scratch/kevin/hadoop-gcs/
Traceback (most recent call last):
  File "gcs_hadoop.py", line 209, in <module>
    task.run()
  File "/usr/local/lib/python2.7/dist-packages/luigi/contrib/hadoop.py", line 707, in run
    self.job_runner().run_job(self)
  File "/usr/local/lib/python2.7/dist-packages/luigi/contrib/hadoop.py", line 552, in run_job
    luigi.contrib.hdfs.HdfsTarget(output_hadoop).move_dir(output_final)
  File "/usr/local/lib/python2.7/dist-packages/luigi/contrib/hdfs/target.py", line 150, in move_dir
    self.fs.rename_dont_move(self.path, path)
  File "/usr/local/lib/python2.7/dist-packages/luigi/contrib/hdfs/abstract_client.py", line 55, in rename_dont_move
    return super(HdfsFileSystem, self).rename_dont_move(path, dest)
  File "/usr/local/lib/python2.7/dist-packages/luigi/target.py", line 184, in rename_dont_move
    raise FileAlreadyExists()
luigi.target.FileAlreadyExists
DEBUG:luigi-interface:Removing directory /tmp/tmp31ZxPX

@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @Tarrasch, @mvj3 and @daveFNbuck to be potential reviewers

geowurster pushed a commit to GlobalFishingWatch/luigi that referenced this pull request Apr 21, 2016
@geowurster
Copy link
Contributor Author

@Tarrasch @erikbern Is there anything I can do to help move this along?

@Tarrasch
Copy link
Contributor

Tarrasch commented May 9, 2016

This PR is changing a module I don't know so much about. Can you do a git blame and look for other contributors who have edited those files and @-mention them to summon them as reviewers of this patch? Once others +1. I can help press "merge".

@geowurster
Copy link
Contributor Author

@Tarrasch 👍

@mvj3 @daveFNbuck The git blame identifies you as contributors to luigi.contrib.hadoop. Could either of you spare some time to review this PR? It allows GCSTarget()'s as input and GCSFlagTarget()'s as output for Hadoop streaming jobs since gs:// paths are recognized by Google Hadoop and HDFS.

@dchentech
Copy link
Contributor

@geowurster The patch seems make sense, and doesn't change the main logic. But it should at least pass the travis CI. And after that, I think it deserves to be merged.

@geowurster
Copy link
Contributor Author

I don't think the failing test is related to the changes.

======================================================================
ERROR: Test copying 20 files from one folder to another
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/spotify/luigi/test/s3_test.py", line 476, in test_copy_dir
    copy_size = s3_client.get_key(s3_dest + str(i)).size
AttributeError: 'NoneType' object has no attribute 'size'
----------------------------------------------------------------------
Ran 1065 tests in 417.341s

@dchentech
Copy link
Contributor

dchentech commented May 10, 2016

@geowurster You're right. I contributed a little patch to hadoop.py before, and I met this similar situation. It's just about complex Python version compatibility problem. Maybe you need to run the CI again if it's a temporary build issue, or maybe just pay sometime to solve this issue (because the previous commit didn't break the travis CI).

@codecov-io
Copy link

Current coverage is 75.93%

Merging #1664 into master will decrease coverage by -2.68%

@@           master   #1664   diff @@
=====================================
  Files          94      94          
  Lines       10236   10240     +4   
  Methods         0       0          
  Branches        0       0          
=====================================
- Hits         8047    7776   -271   
- Misses       2189    2464   +275   
  Partials        0       0          
  1. 4 files (not in diff) in luigi/contrib were modified. more
    • Misses +275
    • Hits -275

Powered by Codecov. Last updated by a3fe817...a92e903

@Tarrasch Tarrasch merged commit f158a75 into spotify:master May 11, 2016
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants